Skip to content

Conversation

@wenhuach21
Copy link
Contributor

@wenhuach21 wenhuach21 commented May 8, 2025

Backgournd

This pr is to support models quantized by AutoRound github paper,

AutoRound delivers significantly higher accuracy at extremely low bit-widths (e.g., 2-bit) and offers broader compatibility across models (LLMs and VLMs), quantization formats, and configurations. You can check out our github/paper or this blog post.

AutoRound has been integrated into both pytorch/ao and Hugging Face Transformers. Several Hugging Face Spaces offer models quantized with AutoRound, including OPEA, Kaitchup, and fbaldassarri.

Known issues

Mixed bits support is limited

Mixed-bit quantization is currently limited. Since vLLM fuses layers (e.g., QKV), applying different bit-widths to components within the same fused layer can lead to incompatibility issues.

Quantized MOE model support is limited

Qwen3-30B-A3B: KeyError: layers.45.mlp.gate.qweight', gptq format has the same issue, while awq reports assert self.quant_method is not None

deepseek-moe-16b-base: The input size is not aligned with the quantized weight shape, or mergedColumnParallelLinear object has no attribute 'weight') , Same issues are exists for awq and gptq

Quantized vlms support is limited

the module names may be different from that of Transformers, this may introduce risk to parse the quantization config correctly

OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc: marlin kernel has issues, need to fallback to gptq kernel

Qwen2.5-VL-7B : auto_round:auto_gptq format failed with marlin and gptq kernel both. gptq model has the similar issue. auto_round:auto_awq and awq format are fine

Signed-off-by: wenhuach21 <[email protected]>
@github-actions
Copy link

github-actions bot commented May 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@wenhuach21 wenhuach21 marked this pull request as draft May 8, 2025 10:36
Signed-off-by: wenhuach21 <[email protected]>
@wenhuach21 wenhuach21 marked this pull request as ready for review May 8, 2025 10:49
Signed-off-by: wenhuach21 <[email protected]>
@wenhuach21
Copy link
Contributor Author

please kindly have a review when you are free

Regarding the preci, the YAPF checker reformats many files that are unrelated to my PR, what should I do?

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting! Are there examples you could add as smoke tests to validate that it works? Possibly to the CPU runner since there is an ipex quant backend, in addition to the gptq/awq forwarding methods

@wenhuach21
Copy link
Contributor Author

Exciting! Are there examples you could add as smoke tests to validate that it works? Possibly to the CPU runner since there is an ipex quant backend, in addition to the gptq/awq forwarding methods

Thanks for the review. The unit tests will be added in the upcoming commits.

Signed-off-by: wenhuach21 <[email protected]>
Signed-off-by: wenhuach21 <[email protected]>
Signed-off-by: wenhuach21 <[email protected]>
@wenhuach21
Copy link
Contributor Author

@mgoin The unit test has been added. I believe the test failure is not related to my PR. If it is, please kindly let me know, and I will fix it soon

@mgoin
Copy link
Member

mgoin commented May 15, 2025

You have quite a few failing precommit tests

Error: vllm/model_executor/layers/quantization/auto_round.py:4:1: UP035 `typing.Dict` is deprecated, use `dict` instead
Error: vllm/model_executor/layers/quantization/auto_round.py:4:1: UP035 `typing.List` is deprecated, use `list` instead
Error: vllm/model_executor/layers/quantization/auto_round.py:38:53: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:39:32: UP006 Use `dict` instead of `Dict` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:81:42: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:89:38: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:93:34: UP006 Use `dict` instead of `Dict` for type annotation
diff --git a/tests/quantization/test_auto_round.py b/tests/quantization/test_auto_round.py
index 79bf731..81ceecd 100644
--- a/tests/quantization/test_auto_round.py
+++ b/tests/quantization/test_auto_round.py
@@ -18,7 +18,8 @@ MODELS = [
 
 
 @pytest.mark.skipif(not current_platform.is_cpu()
-                    and not current_platform.is_xpu() and not current_platform.is_cuda(),
+                    and not current_platform.is_xpu()
+                    and not current_platform.is_cuda(),
                     reason="only supports CPU/XPU/CUDA backend.")
 @pytest.mark.parametrize("model", MODELS)
 def test_auto_round(vllm_runner, model):
diff --git a/vllm/model_executor/layers/quantization/auto_round.py b/vllm/model_executor/layers/quantization/auto_round.py
index 52fa6cd..2967713 100644
--- a/vllm/model_executor/layers/quantization/auto_round.py
+++ b/vllm/model_executor/layers/quantization/auto_round.py
@@ -102,8 +102,8 @@ class AutoRoundConfig(QuantizationConfig):
                 None),
             extra_config=cls.get_from_keys_or(config, ["extra_config"], None),
             data_type=cls.get_from_keys_or(config, ["data_type"], "int"),
-            backend=cls.get_from_keys_or(config, ["backend",
-                                                  "vllm_backend"], "auto"),
+            backend=cls.get_from_keys_or(config, ["backend", "vllm_backend"],
+                                         "auto"),
         )
 
     def get_layer_config(self, layer, layer_name: str):
@@ -302,4 +302,3 @@ class AutoRoundConfig(QuantizationConfig):
             return self.apply_gptq_quant_layer(layer, prefix)
         if "awq" in self.packing_format or "awq" in self.backend:
             return self.apply_awq_quant_layer(layer, prefix)
-

@wenhuach21
Copy link
Contributor Author

wenhuach21 commented May 15, 2025

You have quite a few failing precommit tests

Error: vllm/model_executor/layers/quantization/auto_round.py:4:1: UP035 `typing.Dict` is deprecated, use `dict` instead
Error: vllm/model_executor/layers/quantization/auto_round.py:4:1: UP035 `typing.List` is deprecated, use `list` instead
Error: vllm/model_executor/layers/quantization/auto_round.py:38:53: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:39:32: UP006 Use `dict` instead of `Dict` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:81:42: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:89:38: UP006 Use `list` instead of `List` for type annotation
Error: vllm/model_executor/layers/quantization/auto_round.py:93:34: UP006 Use `dict` instead of `Dict` for type annotation
diff --git a/tests/quantization/test_auto_round.py b/tests/quantization/test_auto_round.py
index 79bf731..81ceecd 100644
--- a/tests/quantization/test_auto_round.py
+++ b/tests/quantization/test_auto_round.py
@@ -18,7 +18,8 @@ MODELS = [
 
 
 @pytest.mark.skipif(not current_platform.is_cpu()
-                    and not current_platform.is_xpu() and not current_platform.is_cuda(),
+                    and not current_platform.is_xpu()
+                    and not current_platform.is_cuda(),
                     reason="only supports CPU/XPU/CUDA backend.")
 @pytest.mark.parametrize("model", MODELS)
 def test_auto_round(vllm_runner, model):
diff --git a/vllm/model_executor/layers/quantization/auto_round.py b/vllm/model_executor/layers/quantization/auto_round.py
index 52fa6cd..2967713 100644
--- a/vllm/model_executor/layers/quantization/auto_round.py
+++ b/vllm/model_executor/layers/quantization/auto_round.py
@@ -102,8 +102,8 @@ class AutoRoundConfig(QuantizationConfig):
                 None),
             extra_config=cls.get_from_keys_or(config, ["extra_config"], None),
             data_type=cls.get_from_keys_or(config, ["data_type"], "int"),
-            backend=cls.get_from_keys_or(config, ["backend",
-                                                  "vllm_backend"], "auto"),
+            backend=cls.get_from_keys_or(config, ["backend", "vllm_backend"],
+                                         "auto"),
         )
 
     def get_layer_config(self, layer, layer_name: str):
@@ -302,4 +302,3 @@ class AutoRoundConfig(QuantizationConfig):
             return self.apply_gptq_quant_layer(layer, prefix)
         if "awq" in self.packing_format or "awq" in self.backend:
             return self.apply_awq_quant_layer(layer, prefix)
-

Yes, I just fixed it. I believe it was caused by a recent change in vLLM: #17656 or other prs. It was working fine before that.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me otherwise, thanks!

@wenhuach21
Copy link
Contributor Author

Seems reasonable to me otherwise, thanks!

Thanks for the review!

@wenhuach21
Copy link
Contributor Author

@mgoin The buildkite/ci/pr job has been running for nearly a day and still hasn't completed. Is this expected?

Also, the preci issue doesn't seem related to my PR, so I assume it's safe to ignore it, right

@mgoin
Copy link
Member

mgoin commented May 16, 2025

No, the precommit issue is required to fix and not failing on main. To run the full CI I have to add the ready label but did not yet because of that issue. I can look at fixing it in a bit if you can't figure it out

Signed-off-by: wenhuach21 <[email protected]>
Signed-off-by: wenhuach21 <[email protected]>
@wenhuach21
Copy link
Contributor Author

No, the precommit issue is required to fix and not failing on main. To run the full CI I have to add the ready label but did not yet because of that issue. I can look at fixing it in a bit if you can't figure it out

Got it, thanks for the reply! I’ve figured out the root cause, the preci check now passes.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2025
Signed-off-by: wenhuach21 <[email protected]>
@wenhuach21
Copy link
Contributor Author

wenhuach21 commented May 17, 2025

@mgoin the recent unit test failures don't appear to be related to this PR. Could you please help double-check? If that's the case, would it be possible to ignore them or provide some guidance on how to fix them?

FAILED quantization/test_bitsandbytes.py::test_load_8bit_bnb_model[meta-llama/Llama-Guard-3-8B-INT8-read pre-quantized llama 8-bit model

FAILED quantization/test_cpu_offload.py::test_cpu_offload_gptq - RuntimeError: Server exited unexpectedly

FAILED quantization/test_cpu_offload.py::test_cpu_offload_awq - RuntimeError: Server exited unexpectedly.

FAILED quantization/test_cpu_offload.py::test_cpu_offload_compressed_tensors - AssertionError: Results for model='nm-testing/llama7b-one-shot-2_4-w4a16-marlin24-t' are not the same

weight-loading-multiple-gpu test

  | [2025-05-17T05:04:50Z] =============================== warnings summary ===============================
  | [2025-05-17T05:04:50Z] ../../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  | [2025-05-17T05:04:50Z]   /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
  | [2025-05-17T05:04:50Z]     ref_error: type[Exception] = jsonschema.RefResolutionError,
  | [2025-05-17T05:04:50Z]
  | [2025-05-17T05:04:50Z] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
  | [2025-05-17T05:04:50Z] ======================== 1 skipped, 1 warning in 3.50s =========================
  | [2025-05-17T05:04:51Z] === PASSED MODEL: None, mgleize/fairseq2-dummy-Llama-3.2-1B, main ===
  | [2025-05-17T05:04:52Z] 🚨 Error: The command exited with status 1
  | [2025-05-17T05:04:52Z] user command error: The plugin docker command hook exited with status 1

@vllm-bot vllm-bot merged commit e2ee1e8 into vllm-project:main May 19, 2025
66 of 69 checks passed
@wenhuach21
Copy link
Contributor Author

Thanks so much, @mgoin, for your kind review and support!

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants